Loop handling for single instruction multiple datapath processor architectures

ABSTRACT

A method of controlling the enabling of processor datapaths in a SIMD processor during a loop processing operation is described. The information used by the method includes an allocation between the data items and a memory, a size of the array, and a number of remaining parallel passes of the datapaths in the loop processing operation. A computer instruction is also provided, which includes a loop handling instruction that specifies the enabling of one of a plurality of processor datapaths during processing an array of data items. The instruction includes a count field that specifies the number of remaining parallel loop passes to process the array and a count field that specifies the number of serial loop passes to process the array. Different instructions can be used to handle different allocations of passes to parallel datapaths. The instruction also uses information about the total number of datapaths.

TECHNICAL FIELD

[0001] This invention relates to loop handling operations over an arrayof data items in a single instruction multiple datapath (SIMD) processorarchitecture.

BACKGROUND

[0002] Parallel processing is an efficient way of processing an array ofdata items. A SIMD processor is a parallel processor array architecturewherein multiple datapaths are controlled by a single instruction. Eachdatapath handles one data item at a given time. In a simple example, ina SIMD processor having four datapaths, the data items in an eight dataitem array would be processed in each of the four datapaths in twopasses of a loop operation. The allocation between datapaths and dataitems may vary, but in one approach, in a first pass the first data itemin the array is processed by a first datapath, a second data item in thearray is processed by a second datapath, a third data item is processedby a third datapath, and a fourth data item is processed by a fourthdatapath. In a second pass, a fifth data item is processed by the firstdatapath, a sixth data item is processed by the second datapath, aseventh data item is processed by the third datapath, and an eighth dataitem is processed by the fourth datapath.

[0003] Problems may occur when the number of data items in the array isnot an integer multiple of the number of datapaths. For example,modifying the simple example above so that there are four datapaths andan array having seven data items, during the second pass, the fourthdatapath does not have an element in the eighth item of the array toprocess. As a result, the fourth datapath may erroneously write oversome other data structure in memory, unless the fourth datapath isdisabled during the second pass.

[0004] One way of avoiding such erroneous overwriting is to force thesize of the array, i.e., the number of data items contained within thearray, to be an integer multiple of the number of datapaths. Such anapproach assumes that programmers have a priori control of how dataitems are allocated in the array, which they may not always have.

[0005] Typically, each datapath in a SIMD processor has an associatedprocessor enable bit that controls whether a datapath is enabled ordisabled. This allows a datapath to be disabled when, e.g., the datapathwould otherwise overrun the array.

SUMMARY

[0006] In a general aspect, the invention features a method ofcontrolling whether to enable one of a plurality of processor datapathsin a SIMD processor that are operating on data elements in an array,including determining whether to enable the datapath based oninformation about parameters of the SIMD processor and the array, and aprocessing state of the datapaths relative to the data items in thearray.

[0007] In a preferred embodiment, the information includes an allocationbetween the data items and a memory, a total number of parallel looppasses in a loop processing operation being performed by the datapaths,a size of the array, and a number of datapaths (i.e., how many datapathsthere are in the SIMD processor). The processing state is a number ofremaining parallel passes of the datapaths in the loop processingoperation.

[0008] The allocation between the data items and the memory may beunity-stride, contiguous or striped-stride.

[0009] In another aspect, the invention features a computer instructionincluding a loop handling instruction that specifies the enabling of oneof a plurality of processor datapaths during processing an array of dataitems.

[0010] In a preferred embodiment, the instruction includes a parallelcount field that specifies the number of remaining parallel loop passesto process the array, and a serial count field that specifies the numberof serial loop passes to process the array.

[0011] In another aspect, the invention features a processor including aregister file and an arithmetic logic unit coupled to the register file,and a program control store that stores a loop handling instruction thatcauses the processor to enable one of a plurality of processor datapathsduring processing of an array of data.

[0012] Embodiments of various aspects of the invention may have, one ormore of the following advantages.

[0013] Datapaths may be disabled without having prior knowledge of thenumber of data items in the array.

[0014] The method is readily extensible to a variety of memoryallocation schemes.

[0015] The loop handling instruction saves instruction memory becausethe many operations needed to determine whether to enable or disable adatapath may be specified with a simple and powerful single instructionthat also saves register space.

[0016] The loop handling instruction saves a programmer from having toforce the number of data items in the array of data items to be aninteger multiple of the number of datapaths.

[0017] Other features and advantages of the invention will be apparentfrom the following detailed description and drawings, and from theclaims.

DESCRIPTION OF DRAWINGS

[0018]FIG. 1 is a block diagram of a single instruction multipledatapath (SIMD) processor.

[0019]FIG. 2 shows a table of how thirty data items in an array arehandled by a SIMD processor having four datapaths during loop processingin a unity stride allocation of memory.

[0020]FIG. 3 shows the syntax of a loop handling instruction.

[0021]FIG. 4 shows a table of how thirty data items in an array arehandled by a SIMD processor having four datapaths during loop processingin a contiguous stride allocation of memory.

[0022]FIG. 5 shows the syntax of a loop handling instruction combinedwith a loop branch.

[0023]FIG. 6 is a flow diagram of a process of controlling the enablingof datapaths in a SIMD processor during loop processing.

[0024] Like reference symbols in the various drawings indicate likeelements.

DETAILED DESCRIPTION

[0025] Referring to FIG. 1, a single instruction multiple datapath(SIMD) processor 10 includes an instruction cache 12, control logic 14,a serial datapath, and a number of parallel datapaths labeled 18 a, 18b, 18 c, 18, . . . 18 n. The parallel datapaths 18 write to a memory 20.Each of the datapaths 18 has an associated processor enable (PE) bit 22.Specifically, parallel datapath 18 a is associated with a PE bit 22 a,parallel datapath 18 b is associated with a PE bit 22 b, and so forth.When a PE bit is enabled, its associated parallel datapath is enabledand data items may be written by that parallel datapath. For example, ifPE bit 22 a is enabled, data items may be written by parallel datapath18 a; if PE bit 22 b is enabled, data items may be written by paralleldatapath 18 b. If PE bit 22 n is enabled, data items may be written byparallel datapath 18 n. When a PE bit is disabled, its associatedparallel datapath is disabled and data items may not be written by thatparallel datapath.

[0026] In operation, the control logic 14 fetches an instruction fromthe instruction cache 12. The instruction is fed to the serial datapath16 that provides the instruction to the datapaths 18. Each of thedatapaths 18 are read together and written together unless the processorenable bit is disabled for a particular datapath.

[0027] One or more of the datapaths 18 may need to be disabled during aloop processing operation of an array of data items to avoid an unuseddatapath from overrunning the end of the array and erroneously writingover another data structure in memory. Rather than manually having todetermine when during the loop processing operation to enable anddisable datapaths, this determination may be made on the fly during theloop processing operation, based on information about parameters of theSIMD processor and the array, and the processing state of the datapathsrelative to the data items in the array. This information includes: (1)the total number of parallel loop passes occurring in the loopprocessing operation, (2) the number of loop passes that would executein a serial datapath design (which indicates the size of the array), (3)the number of remaining parallel passes occurring in the loop processingoperation, (4) the memory allocation used to allocate data items of thearray among the datapaths, and (5) the number of parallel datapaths.Instructions that enable or disable a processor enable bit for adatapath (thereby enabling or disabling the datapath) during loopprocessing based on this information are provided.

[0028] There are many ways to allocate memory for processing of an arrayof data items in a SIMD processor. The simplest memory allocation iswhere each one of a number of datapaths (NDP) takes the NDPth iterationof the loop. This type of memory allocation is called “unity stride.”

[0029] Referring to FIG. 2, for example, a table illustrating how thirtydata items numbered 0 to 29 in an array are handled by a SIMD processorhaving four datapaths labeled DP0, DP1, DP2 and DP3, respectively,during loop processing in a unity stride memory allocation is shown. Inorder to process the array, eight parallel loop passes are executed. Ina parallel loop pass 1, data items 0, 1, 2, and 3 are handled bydatapaths 0, 1, 2, and 3. In a parallel loop pass 2, data items 4, 5, 6and 7 are handled by datapaths 0, 1, 2, and 3. In a final parallel looppass, parallel loop pass 8, data items 28 and 30 and handled bydatapaths 0 and 1 while datapaths 2 and 3 must be disabled to avoidoverrunning the array and writing over other data stored in memory.

[0030] The table in FIG. 2 illustrates why this type of memoryallocation is referred to as unity-stride. The “stride” between dataitems being processed in each of the parallel datapaths in any givenparallel loop pass is one. That is, the difference between any two dataitems being processed by parallel datapaths in a parallel loop pass isone (or unity).

[0031] In the unity stride allocation, as the number of data items arebeing processed a pattern emerges. Specifically, the pattern illustratesthat only two datapaths in a final parallel loop pass need to bedisabled. (Obviously, the pattern illustrated in FIG. 2 is trivial; asthe number of datapaths and the array size are increased, the patternbecomes more complex, but is discernible in time.) From a knowledge ofthe pattern, the total number of loop passes that would execute in aserial machine (which indicates the size of the array), the number ofremaining parallel loop passes, and the number of datapaths, aninstruction is provided to determine whether a particular datapathshould be disabled during a particular parallel loop pass.

[0032] Referring to FIG. 3, a loop processor enable instruction 30includes a field C representing the number of remaining parallel looppasses during a loop processing operation, and a field L representingthe overall number of passes needed to service all the data items in anarray in a serial machine architecture. The instruction 30 includes amemory allocation designation x. In the example described with referenceto FIG. 2, the memory allocation designation x would refer to aunity-stride memory allocation, i.e., U, and L=30 since there are thirtydata items that would require thirty loop passes in a serial machinearchitecture. PE [i, j] represents the state of the processor enable bitfor datapath i during parallel loop pass j.

[0033] For the unity-stride example described in reference to FIG. 2,the total number of parallel loop passes is determined by dividing thetotal number of serial loop passes by the number of datapaths, androunding the result up to the next integer. Thus, in the example thetotal number of parallel loop passes equals 30/4, which rounded up tothe next integer produces 8.

[0034] Using the knowledge gained from the pattern present in theunity-stride example and the values of C and L, a processor enable bitassociated with a datapath index i representing the datapath and a dataitem j, that is, PE [i, j], is enabled if the total number of parallelloop passes minus the number of remaining parallel loop passes, allmultiplied by the total number of datapaths plus the datapath index, isless than the total number of serial loop passes.

[0035] Alternatively, SIMD processor 10 may use a contiguous stridememory allocation. Referring to FIG. 4, a table illustrating how thirtydata items (0 to 29) in an array are handled by SIMD processor 10 havingfour datapaths (DP0-DP3) and implementing a contiguous stride memoryallocation is shown. In order to process all thirty data items in thearray, eight parallel passes are executed. In a parallel loop pass 1,data items 0, 8, 16 and 24 are handled by datapaths 0, 1, 2 and 3,respectively. In parallel loop pass 2, data items 1, 9, 17 and 25 arehandled by datapaths 0, 1, 2 and 3. As processing continues, a patternarises. In this specific example, in parallel loop passes 7 and 8,datapath 3 needs to be disabled to avoid writing over memory beyond theend of the thirty data items in the array. All other datapaths areenabled in every pass.

[0036] The contiguous-stride memory allocation is useful whenneighboring data items are used when working on a particular data item.For example, if datapath 0 is processing data item 4 in parallel looppass 5, it already has data item 3 from parallel loop pass 4 and will beusing data item 5 on the next parallel loop pass. This memory allocationis called contiguous stride allocation because each datapath isaccessing a contiguous region of the array.

[0037] In the contiguous stride memory allocation, a pattern emerges tosuggest that a single datapath needs to be disabled during executionsof, in this example, the last two parallel loop passes. Referring againto FIG. 3, a memory allocation designation x=CONT represents acontiguous-stride memory allocation scheme. For the example describedwith reference to FIG. 4, the total number of parallel loop passesneeded to process the array of data items is determined by dividing thetotal number of serial loop passes by the number of datapaths androunding the result up to the next integer. Thus, in the example, thetotal number of parallel loop passes equals 30/4, rounded up to 8.

[0038] From the contiguous-stride memory allocation pattern and thevalues of C and L, a processor enable bit associated with a datapathindex i and a data item j, that is, PE [i, j], is enabled if the totalnumber of parallel loop passes multiplied by the datapath index plus thetotal number of parallel loop passes minus the number of remainingparallel loop passes is less than the total number of serial looppasses.

[0039] An interleaved memory system permits many memory accesses to bedone at once. The number of memory banks M in an interleaved memorysystem is generally a power of two, since that allows the memory bankselection to be made using the lowest address bits. If the stride in aread or write instruction is also a power of two, the memoryinterleaving will not help, since all the addresses will try to accessthe same memory bank. For example, if M=4 and the stride is also four,the addresses for the read or write would be 0, 4, 8, and so forth, andthey would all have to be handled by bank 0; banks 1, 2 and 3 would beidle.

[0040] To avoid having all of the data items processed in the samememory bank, the stride value may be selected to be an odd number.Selecting the stride to be an odd number spreads the addresses evenlyamong M banks if M is a power of two, since any odd number and any powerof two are mutually prime. In the case of a 30 element array, the stridewould be 9, not 8 as with the contiguous allocation. Datapath 0 wouldcorrespond to array elements 0 to 8, datapath 1 would be associated witharray elements 9 to 17, and datapath 2 would correspond to elements 18to 26, and datapath 3 would be assigned to elements 26 to 29. Datapath 3would be turned off for the last six elements, i.e., array elements 30to 35. This memory allocation is referred to as a striped-stride memoryallocation.

[0041] The number of parallel loop passes needed to process an array ofdata items in a striped-stride memory allocation scheme is determined bydividing the total number of serial datapaths by the number of datapathsand rounding the result up to the next odd integer.

[0042] Referring again to FIG. 3, a memory designation x=S representsstriped-stride allocation. A processor enable bit associated with adatapath i and a data item j, that is, PE [i, j], is enabled if thetotal number of parallel loop passes times the datapath index plus thetotal number of parallel loop passes minus the number of remainingparallel loop passes is less than the total number of serial looppasses.

[0043] Referring to FIG. 5, the loop processor enable instruction isshown combined with a loop branch instruction 70. This combinedinstruction 70 will set the processor enable bit, as describedpreviously, according to the memory allocation scheme, the overallnumber of parallel loop passes and the number of remaining parallel looppasses, and test if the number of remaining parallel loop passes equalszero. If the number of remaining passes greater than zero, the branch isperformed (i.e., “go to PC+displacement”), to perform the next pass ofthe loop operation. Otherwise, the loop is exited, and processingcontinues. In either case, the number of remaining parallel loop passesis decremented and the loop processing operation continues.

[0044] Referring to FIG. 6, a process 100 of controlling the enabling ofa datapath in a SIMD processor during loop processing determines 102 thenumber of serial loop passes to service all of the data items in anarray. The process determines 104 the number of remaining parallel looppasses to service the array. The process then tests 106 whether thememory allocation scheme is a unity stride allocation. If the memoryallocation is a unity stride allocation, the processor enable bit forthe datapath servicing the data item is enabled 108 if the total numberof parallel loop passes minus the number of remaining parallel looppasses, all multiplied by the total number of datapaths plus thedatapath index, is less than the total number of serial loop passes.

[0045] If the memory allocated is not unity stride, the process tests110 whether the memory allocation scheme is a contiguous strideallocation. If the memory allocation is a contiguous stride allocation,the processor enable bit for the datapath servicing the data item isenabled 112 if the total number of parallel loop passes multiplied bythe datapath index plus the total number of parallel loop passes minusthe number of remaining parallel loop passes is less than the totalnumber of serial loop passes.

[0046] Finally, if the memory allocation is neither unity stride norcontiguous, the process tests 114 whether the memory allocation schemeis a striped stride allocation. If the memory allocation is a stripedstride allocation, the processor enable bit for the datapath servicingthe data item is enabled 116 if the total number of parallel loop passestimes the datapath index plus the total number of parallel loop passesminus the number of remaining parallel loop passes is less than thetotal number of serial loop passes.

[0047] A number of embodiments of the invention have been described.Nevertheless, it will be understood that various modifications may bemade without departing from the spirit and scope of the invention. Forexample, for processing larger numbers of data items, a lookup tablecould be utilized until a time at which a pattern develops according tothe memory allocation scheme employed. Once the pattern develops, theenabling of datapaths is determined by the method herein described.Accordingly, other embodiments are within the scope of the followingclaims.

What is claimed is:
 1. A method of controlling whether to enable one ofa plurality of processor datapaths in a SIMD processor that areoperating on data elements in an array, comprising: determining whetherto enable the datapath based on information about parameters of the SIMDprocessor and the array, and a processing state of the datapathsrelative to the data items in the array.
 2. The method of claim 1wherein the information includes an allocation between the data itemsand a memory.
 3. The method of claim 2 wherein the information includeswhether the allocation is unity-stride, contiguous, or striped stride.4. The method of claim 1 wherein the information includes a total numberof parallel loop passes in a loop processing operation being performedby the datapaths.
 5. The method of claim 1 wherein the informationindicates a size of the array.
 6. The method of claim 1 wherein theprocessing state is a number of remaining parallel loop passes in theloop processing operation.
 7. The method of claim 1 wherein theinformation includes a number of said processor datapaths.
 8. The methodof claim 1 wherein the information includes an allocation between thedata items and a memory, a total number of parallel loop passes in aloop processing operation being performed by the datapaths, a size ofthe array, a number of remaining parallel passes of the datapaths in theloop processing operation, and a number of said processor datapaths. 9.The method of claim 8 wherein the allocation between the data items andthe memory is a unity-stride.
 10. The method of claim 9 wherein thetotal number of loop passes is determined by dividing a total number ofserial loop passes by the total number of datapaths implemented androunded up to a next integer.
 11. The method of claim 10 whereinenabling comprises: determining whether the total number of parallelloop passes minus the number of remaining loop passes multiplied by thetotal number of datapaths implemented plus a datapath number is lessthan the total number of serial loop passes.
 12. The method of claim 8wherein the allocation between the data items and the memory is acontiguous stride.
 13. The method of claim 12 wherein the total numberof parallel loop passes is determined by dividing the total number ofserial loop passes by the number of datapaths and rounded up to a nextinteger.
 14. The method of claim 13 wherein enabling comprises:determining whether the total number of parallel loop passes multipliedby a datapath number plus the total number of parallel loop passes minusa number of remaining parallel loop passes is less than the total numberof serial loop passes.
 15. The method of claim 8 wherein the allocationis a striped stride.
 16. The method of claim 15 wherein enablingcomprises: determining whether the total number of parallel loop passestimes a datapath number plus the total number of parallel loop passesminus a number of remaining parallel loop passes is less than the totalnumber of serial loop passes.
 17. A computer instruction comprising: aloop handling instruction that specifies the enabling of one of aplurality of processor datapaths during processing an array of dataitems.
 18. The instruction of claim 17 further comprising: a parallelcount field that specifies the number of remaining parallel loop passesto process the array.
 19. The instruction of claim 17 furthercomprising: a serial count field that specifies the number of serialloop passes to process the array.
 20. A processor comprising: a registerfile; an arithmetic logic unit coupled to the register file and aprogram control store that stores a loop handling instruction thatcauses the processor to enable one of a plurality of processor datapathsduring processing of an array of data.